Systems and methods for monitoring application health in a distributed architecture

ABSTRACT

A computing device configured for monitoring and analyzing health of a distributed computer system having a plurality of interconnected system components. The computing device tracks communication between the system components and monitors for an alert indicating an error in the communication in the distributed computer system. In response to the error, the computing device receives a health log from each of the system components defining an aggregate health log being in a standardized format indicating messages communicated between the system components. The computing device further receives network infrastructure information defining relationships between the system components and characterizing dependency information; and, automatically determines, based on the aggregate health log and the network infrastructure information, a particular component originating the error and associated dependent components from the system components affected.

CROSS REFERENCE TO RELATED APPLICATION

This application is a continuation of U.S. patent application Ser. No.16/925,862, filed Jul. 10, 2020, and entitled “SYSTEMS AND METHODS FORMONITORING APPLICATION HEALTH IN A DISTRIBUTED ARCHITECTURE”, thecontents of which are incorporated herein by reference.

FIELD

The present disclosure generally relates to monitoring applicationhealth of interconnected application system components within adistributed architecture system. More particularly, the disclosurerelates to a holistic system for automatically identifying a root sourceof one or more errors in the distributed architecture system forsubsequent analysis.

BACKGROUND

In a highly distributed architecture, current error monitoring systemsutilize monitoring rules to track only an individual component in thearchitecture (typically the output component interfacing with externalcomponents) and raise an alert based on the individual component beingtracked indicating an error. Thus, the monitoring rules can trigger thealert based on the individual component's error log, but do not takeinto account the whole distributed architecture system and rather relyon developers to manually troubleshoot and determine where the error mayhave actually occurred in a fragmented and error prone manner. That is,current monitoring techniques involve error analysis which is performedhaphazardly using trial and error as well as being heavily humancentric. This provides an unpredictable and fragmented analysis whileutilizing extensive manual time and cost to possibly determine a rootcause which may not be accurate.

Thus, when there is an error at one of the system components, theanalysis requires the support team to manually determine whether theerror may have originated in the component which alerted the error orelsewhere in the system which may lead to uncertainties and beingunfeasible due to the complexities of the distributed architecture.

In prior monitoring systems of distributed architectures, when an erroroccurs within the system, a system component (e.g. an API) directlyassociated with the user interface reporting the error may first beinvestigated and then a manual and resource intensive approach isperformed to examine each and every system component to determine wherethe error would have originated.

Accordingly, there is a need to provide a method and system tofacilitate automated and dynamic application health monitoring indistributed architecture systems with a view to the entire system, suchas to obviate or mitigate at least some or all of the above presenteddisadvantages.

SUMMARY

It is an object of the disclosure to provide a computing device forimproved holistic health monitoring of system components (e.g. APIsoftware components) in a multi-component system of a distributedarchitecture to determine a root cause of errors (e.g. operationalissues or software defects). In some aspects, this includes proactivelyspotting error patterns in the distributed architecture and notifyingparties. The proposed disclosure provides, in at least some aspects, astandardized mechanism of automatically determining one or more systemcomponents (e.g. an API) originating the error in the distributedarchitecture.

There is provided a computing device for monitoring and analyzing healthof a distributed computer system having a plurality of interconnectedsystem components, the computing device having a processor coupled to amemory, the memory storing instructions which when executed by theprocessor configure the computing device to: track communication betweenthe system components and monitor for an alert indicating an error inthe communication in the distributed computer system, upon detecting theerror: receive a health log from each of the system components togetherdefining an aggregate health log, each health log being in astandardized format indicating messages communicated between the systemcomponents; receive, from a data store, network infrastructureinformation defining one or more relationships for connectivity andcommunication flow between the system components, the relationshipscharacterizing dependency information between the system components;and, automatically determine, based on the aggregate health log and thenetwork infrastructure information, a particular component of the systemcomponents originating the error and associated dependent componentsfrom the system components affected. The standardized format maycomprise a JSON format.

Each health log may further comprise a common identifier for tracing aroute of the messages communicated for a transaction having the error.

The computing device may further be configured to obtain healthmonitoring rules comprising data integrity information for pre-definedcommunications between the system components from the data store, thehealth monitoring rules for verifying whether each of the health logscomplies with the data integrity information.

The health monitoring rules may further defined based on historicalerror patterns for the distributed computer system associating a set oftraffic flows for the messages between the system components andpotentially occurring in each of the health logs to a correspondingerror type.

The computing device may further be configured to: determine from thedependency information indicating which of the system components aredependent on one another for operations performed in the distributedcomputer system, an impact of the error originated by the particularcomponent on the associated dependent components.

The computing device may further be configured to, upon detecting thealert, for: displaying the alert on a user interface of a clientapplication for the device, the alert based on the particular componentoriginating the error determined from the aggregate health log.

The computing device may further be configured for displaying on theuser interface along with the alert, the associated dependent componentsto the particular component.

The system components may be APIs (application programming interfaces)on one or more connected computing devices and the health log may be anAPI log for logging activity for the respective API in communicationwith other APIs and related to the error.

The processor may further configure the computing device toautomatically determine origination of the error by: comparing each ofthe health logs in the aggregate health log to the other health logs inresponse to the relationships in the network infrastructure information.

There is provided a computer implemented method for monitoring andanalyzing health of a distributed computer system having a plurality ofinterconnected system components, the method comprising: trackingcommunication between the system components and monitor for an alertindicating an error in the communication in the distributed computersystem, upon detecting the error: receiving a health log from each ofthe system components together defining an aggregate health log, eachhealth log being in a standardized format indicating messagescommunicated between the system components; receiving, from a datastore, network infrastructure information defining one or morerelationships for connectivity and communication flow between the systemcomponents, the relationships characterizing dependency informationbetween the system components; and, automatically determining, based onthe aggregate health log and the network infrastructure information, aparticular component of the system components originating the error andassociated dependent components from the system components affected.

There is provided a computer readable medium comprising a non-transitorydevice storing instructions and data, which when executed by a processorof a computing device, the processor coupled to a memory, configure thecomputing device to: track communication between system components of adistributed computer system having a plurality of interconnected systemcomponents and monitor for an alert indicating an error in thecommunication in the distributed computer system, upon detecting theerror: receive a health log from each of the system components togetherdefining an aggregate health log, each health log being in astandardized format indicating messages communicated between the systemcomponents; receive, from a data store, network infrastructureinformation defining one or more relationships for connectivity andcommunication flow between the system components, the relationshipscharacterizing dependency information between the system components;and, automatically determine, based on the aggregate health log and thenetwork infrastructure information, a particular component of the systemcomponents originating the error and associated dependent componentsfrom the system components affected.

There is provided a computer program product comprising a non-transientstorage device storing instructions that when executed by at least oneprocessor of a computing device, configure the computing device toperform in accordance with the methods herein.

BRIEF DESCRIPTION OF THE DRAWINGS

These and other features of the disclosure will become more apparentfrom the following description in which reference is made to theappended drawings wherein:

FIG. 1 is a schematic block diagram of a computing system environmentfor providing automated application health monitoring and errororigination analysis in accordance with one or more aspects of thepresent disclosure.

FIG. 2 is a schematic block diagram illustrating example components of adiagnostics server in FIG. 1 , in accordance with one or more aspects ofthe present disclosure.

FIG. 3 is a flowchart illustrating example operations of the diagnosticsserver of FIG. 1 , in accordance with one or more aspects of the presentdisclosure.

FIG. 4 is a schematic block diagram showing an example communicationbetween the computing device comprising a plurality of interconnectedsystem components A, B, and C and the diagnostic server componentscomprising the automatic analyzer and the data store in FIG. 1 inaccordance with one or more aspects of the present disclosure.

FIG. 5 is a diagram illustrating example health logs received fromdifferent system components and actions taken at the automatic analyzerin accordance with one or more aspects of the present disclosure.

FIG. 6 is a diagram illustrating an example diagnostic results alert inaccordance with one or more aspects of the present disclosure.

FIG. 7 is a diagram illustrating a typical flow of communication for thehealth monitoring performed by the automatic analyzer for providing deepdiagnostic analytics in accordance with one or more aspects of thepresent disclosure.

DETAILED DESCRIPTION

FIG. 1 is a diagram illustrating an example computer network 100 inwhich a diagnostics server 108, is configured for providing unified deepdiagnostics of distributed system components and particularly, errorcharacterization analysis for the distributed components of one or morecomputing device(s) 102 communicating across a communication network106. The diagnostics server 108 is configured to receive an aggregatehealth log including communication health logs 107 (individually 107A,107B . . . 107N) from each of the system components, collectively shownas system components 104 (individually shown as system components104A-104N) such as API components in a standard format. Thecommunication health logs 107 may be linked for example via a common keytracing identifier that may show that a particular transaction involvedcomponents A, B, and C and the types of events or messages communicatedfor the transaction, by way of example. In one example, the commonidentifier comprises key metadata that interconnects via an entityfunction role. In one case, if the messages communicated betweencomponents 104A-104N are financial transactions then the common tracingidentifier may link parties affecting a particular financialtransaction. The common tracing identifier (e.g. traceability ID 506 inFIG. 5 ) may further be modified each time it is processed or otherwisepasses through one of the components 104 to also facilitate identifyinga path taken by a message when communicated between the components 104during performing a particular function (e.g. effecting a transaction).

The computing device(s) 102 each comprise at least a processor 103, amemory 105 (e.g. a storage device, etc.) and one or more distributedsystem components 104. The memory 105 storing instructions which whenexecuted by the computing device(s) 102 configure the computingdevice(s) 102 to perform operations described herein. The distributedsystem components 104 may be configured (e.g. via the instructionsstored in the memory 105) to provide the distributed architecture systemdescribed herein for collaborating together to provide a common goalsuch as access to resources on the computing devices 102; or access tocommunication services provided by the computing device 102; orperforming one or more tasks in a distributed manner such that thecomputing nodes work together to provide the desired task functionality.The distributed system components 104 may comprise distributedapplications such as application programming interfaces (APIs), userinterfaces, etc.

In some aspects, such a distributed architecture system provided by thecomputing device(s) 102 includes the components 104 being provided ondifferent platforms (e.g. correspondingly different machines such thatthere are at least two computing devices 102 each containing some of thecomponents 104) so that a plurality of the components (e.g. 104A . . .104N) can cooperate with one another over the communication network 106in order to achieve a specific objective or goal (e.g. completing atransaction or performing a financial trade). For example the computingdevice(s) 102 may be one or more distributed servers for variousfunctionalities such as provided in a trading platform. Another exampleof the distributed system provided by computing device(s) 102 may be aclient/server model. In this aspect no single computer in the systemcarries the entire load on system resources but rather the collaboratingcomputers (e.g. at least two computing devices 102) execute jobs in oneor more remote locations.

In yet another aspect, the distributed architecture system provided bythe computing device(s) 102 may be more generally, a collection ofautonomous computing elements (e.g. which may be either hardware devicesand/or a software processes such as system components 104) that appearto users as a single coherent system. Typically, the computing elements(e.g. either independent machines or independent software processes)collaborate together in such a way via a common communication network(e.g. network 106) to perform related tasks. Thus, the existence ofmultiple computing elements is transparent to the user in a distributedsystem.

Furthermore, as described herein, although a single computing device 102is shown in FIG. 1 with distributed computing elements provided bysystem components 104A-104N which reside on the single computing device102; alternatively, a plurality of computing devices 102 connectedacross the communication network 106 in the network 100 may be providedwith the components 104 spread across the computing devices 102 tocollaborate and perform the distributed functionality via multiplecomputing devices 102.

The communications network 106 is thus coupled for communication with aplurality of computing devices. It is understood that communicationnetwork 106 is simplified for illustrative purposes. Communicationnetwork 106 may comprise additional networks coupled to the WAN such asa wireless network and/or local area network (LAN) between the WAN andthe computing devices 102 and/or diagnostics server 108.

The diagnostics server 108 further retrieves network infrastructureinformation 111 for the system components 104 (e.g. may be stored on thedata store 116, or directly provided from the computing devices 102hosting the system components 104). The network infrastructureinformation 111 may characterize various types of relationships betweenthe system components and/or communication connectivity information forthe system components. For example, this may include dependencyrelationships, such as operational dependencies or communicationdependencies between the system components 104A-104N for determining thehealth of the system and tracing an error in the system to its source.

The operational dependencies may include for example, whether a systemcomponent 104 requires another component to call upon or otherwiseinvolve in order to perform system functionalities (e.g. performing afinancial transaction may require component A to call uponfunctionalities of components B and N). The communication dependenciesmay include information about which components 104 are able tocommunicate with and/or receive information from one another (e.g. havewired or wireless communication links connecting them).

Additionally, the diagnostic server 108 comprises an automatic analyzermodule 214 communicating with the data store 116 as will be furtherdescribed with respect to FIG. 2 . The automatic analyzer module 214receives aggregate health logs 107 for each of the components 104A . . .104N associated with a particular task or job (e.g. accessing a resourceprovided by components 104) as well as network infrastructureinformation 111 and is then configured to determine a root cause of theerror characterizing a particular system component (e.g. 104A) whichoriginated an error in the system. The automatic analyzer module 214 maybe triggered to detect the source of an error upon monitoring systembehaviors and determining that an error has occurred in the network 100.Such a determination may be made by applying a set of monitoring rules109 via the automatic analyzer module 214 which are based on historicalerror patterns for the system components 104 and associated trafficpatterns thereby allowing deeper understanding of the error (e.g. APIconnection error) and the expected operational resolution. In oneaspect, the monitoring rules 109 may be used by the automatic analyzermodule 214 to map a historical error pattern (e.g. communicationsbetween components 104 following a specific traffic pattern as may bepredicted by a machine learning module in FIG. 2 ) to a specific errortype. Additionally, in at least one aspect, the health monitoring rules109 may indicate data integrity metadata indicating a format and/orcontent of messages communicated between components 104. In this way,when the messages differ from the data integrity metadata, then theautomatic analyzer module 214 may indicate (e.g. via a display on thediagnostics server 108 or computing device 102) that the error relatesto data integrity deviations.

Additionally, in at least one aspect, the automatic analyzer module 214may use the network infrastructure information 111 and the monitoringrules 109 (mapping error patterns to additional metadata characterizingthe error) to identify the error, its root cause (e.g. via therelationship information in the network infrastructure information 111)and the dependency impact including other system components 104 affectedby the error and having a relationship to the error originating systemcomponent.

Thus, in one or more aspects, the network 100 utilizes a holisticapproach to health monitoring by providing an automatic analyzer 214coupled to all of the system components 104 (e.g. APIs) via the network106 for analyzing the health of the system components 104 as a whole andindividually. Notably, when an error occurs in the system (e.g. an APIfails to perform an expected function or timeout occurs), the error maybe tracked and its origin located.

In at least one aspect, the health logs 107 are converted to and/orprovided in a standardized format (e.g. JSON format) from each of thesystem components 104. The standardized format may further include asmart log pattern which can reveal functional dependencies between thesystem components 104, and key metadata which interconnects message fora particular task or job (e.g. customer identification). The diagnosticsserver 108 is thus configured to receive the health logs 107 in astandardized format as well as receiving information about the networkinfrastructure (e.g. relationships and dependencies between the systemcomponents) from a data store to determine whether a detected systemerror follows a specific system error pattern and therefore thedependency impact of the error on related system components 104.

FIG. 2 is a diagram illustrating in schematic form an example computingdevice (e.g. diagnostics server 108 of FIG. 1 ), in accordance with oneor more aspects of the present disclosure. The diagnostics server 108facilitates providing a system to perform health monitoring ofdistributed architecture components (e.g. APIs) as a whole using healthlogs (e.g. API logs) and network architecture information definingrelationships for the distributed architecture components. The systemmay further capture key metadata (e.g. key identifiers such as digitalidentification number of a transaction across an institution amongvarious distributed components) to track messages communicated betweenthe components and facilitate determining the route taken by the messagewhen an error was generated. Preferably, as described herein, thediagnostics server 108 is configured to utilize at least the health logsand the network architecture information to determine a root cause of anerror generated in the overall system.

Diagnostics server 108 comprises one or more processors 202, one or moreinput devices 204, one or more communication units 206 and one or moreoutput devices 208. Diagnostics server 108 also includes one or morestorage devices 210 storing one or more modules such as automaticanalyzer module 214; data integrity validation module; infrastructurevalidation module 218; machine learning module 220; alert module 222; adata store 116 for storing data comprising health logs 107; monitoringrules 109; and network infrastructure information 111.

Communication channels 224 may couple each of the components 116, 202,204, 206, 208, 210, 214, 216 and 218 for inter-component communications,whether communicatively, physically and/or operatively. In someexamples, communication channels 224 may include a system bus, a networkconnection, an inter-process communication data structure, or any othermethod for communicating data.

One or more processors 202 may implement functionality and/or executeinstructions within diagnostics server 108. For example, processors 202may be configured to receive instructions and/or data from storagedevices 210 to execute the functionality of the modules shown in FIG. 2, among others (e.g. operating system, applications, etc.) Diagnosticsserver 108 may store data/information to storage devices 210 such ashealth logs 107; monitoring rules 109 and network infrastructure info111. Some of the functionality is described further below.

One or more communication units 206 may communicate with externaldevices via one or more networks (e.g. communication network 106) bytransmitting and/or receiving network signals on the one or morenetworks. The communication units may include various antennae and/ornetwork interface cards, etc. for wireless and/or wired communications.

Input and output devices may include any of one or more buttons,switches, pointing devices, cameras, a keyboard, a microphone, one ormore sensors (e.g. biometric, etc.) a speaker, a bell, one or morelights, etc. One or more of same may be coupled via a universal serialbus (USB) or other communication channel (e.g. 224).

The one or more storage devices 210 may store instructions and/or datafor processing during operation of diagnostics server 108. The one ormore storage devices may take different forms and/or configurations, forexample, as short-term memory or long-term memory. Storage devices 210may be configured for short-term storage of information as volatilememory, which does not retain stored contents when power is removed.Volatile memory examples include random access memory (RAM), dynamicrandom access memory (DRAM), static random access memory (SRAM), etc.Storage devices 210, in some examples, also include one or morecomputer-readable storage media, for example, to store larger amounts ofinformation than volatile memory and/or to store such information forlong term, retaining information when power is removed. Non-volatilememory examples include magnetic hard discs, optical discs, floppydiscs, flash memories, or forms of electrically programmable memory(EPROM) or electrically erasable and programmable (EEPROM) memory.

Referring to FIGS. 1 and 2 , automatic analyzer module 214 may comprisean application which monitors communications between system components104 and monitors for an alert indicating an error in the communications.Upon indication of an alert, the automatic analyzer module 214 receivesan input indicating a health log (e.g. 107A . . . 107N) from each of thesystem components 104 together defining an aggregate health log 107 andthe network infrastructure information 111 defining relationshipsincluding interdependencies for connectivity and/or operation and/orcommunication between the system components. Based on this, theautomatic analyzer module 214 automatically determines a particularcomponent of the system components originating the error and associateddependent components from the system components affected. In oneexample, this may include the automatic analyzer module 214 using thestandardized format of messages in the health logs 107 to capture keyidentifiers (e.g. connection identifiers, message identifiers, etc.)linking a particular task to the messages and depicting a routetravelled by the messages and applying the network infrastructureinformation 111 to the health logs to reveal a source of the error andthe dependency impact. In some aspects, the automatic analyzer module214 further accesses a set of monitoring rules 109 which may associatespecific types of messages or traffic flows indicated in the health logswith specific system error patterns and typical dependency impacts (e.g.for a particular type of error X, system components A, B, and C would beaffected).

The machine learning module 220 may be configured to track communicationflows between components 104, usage/error patterns of the components 104over a past time period to the current time period and help predict thepresence of an error and its characteristics. The machine learningmodule 220 may generate a mapping table between specific error patternsin the messages communicated between the components 104 andcorresponding information characterizing the error including error type,possible dependencies and expected operational resolution. In this way,the machine learning module 220 may utilize machine learning models suchas regression techniques or convolutional neural networks, etc. toproactively predict additional error patterns and associated detailsbased on historical usage data. In at least some aspects, the machinelearning module 220 cooperates with the automatic analyzer module 214for proactively determining that an error exists and characterizing theerror.

Data integrity validation module 216 may be configured to retrieve a setof predefined data integrity rules provided in the monitoring rules 109to determine whether the data in the health logs 107 satisfies the dataintegrity rules (e.g. format and/or content of messages in the healthlogs 107).

Infrastructure validation module 218 may be configured to retrieve a setof predefined network infrastructure rules (e.g. for a particular task)based on information determined from the health logs 107 and determinewhether the data in the network infrastructure info 111 satisfies thepredefined rules 109.

Alert module 222 may comprise a user interface either located on theserver 108 or control of an external user interface (e.g. via thecommunication units 206) to display the error detected by the server 108and characterizing information (e.g. the source of the error, dependencyimpacts, and possible operational solutions) to assist with theresolution of the error. An example of such an alert is shown in FIG. 6.

Referring again to FIG. 2 , it is understood that operations may notfall exactly within the modules 214; 216; 218; 220; and 222 such thatone module may assist with the functionality of another.

FIG. 3 is a flow chart of operations 300 which are performed by acomputing device such diagnostics server 108 shown in FIGS. 1 and 2 .The computing device may comprise a processor and a communications unitconfigured to communicate with distributed system application componentssuch as API components to monitor the application health of the systemcomponents and to determine the source of an error for subsequentresolution. The computing device (e.g. the diagnostics server 108) isconfigured to utilize instructions (stored in a non-transient storagedevice), which when executed by the processor configured the computingdevice to perform operations such as operations 300.

At 302, operations of the computing device (e.g. diagnostics server 108)track communication between the system components (e.g. components 104)in a distributed system and monitor for an alert indicating an error inthe communication in the distributed computer system. In one aspect,monitoring for the alert includes applying monitoring rules to thecommunication to proactively detect errors in the distributed system bymonitoring for the communication between the components matching aspecific error pattern. In one aspect, the computing device may furtherbe configured to obtain the monitoring rules which include dataintegrity information for each of the types of communications betweenthe system components. The monitoring rules may be used to verifywhether the health logs comply with the data integrity information (e.g.to determine whether the data being communicated or otherwise transactedis consistent and accurate over the lifecycle of a particular task).

In one aspect, the health monitoring rules may further be defined basedon historical error patterns for the communications in the distributedcomputer system. That is, a pattern set of pre-defined communicationtraffic flows for messages between the system components which may occurin each of the health logs may be mapped to particular error types.Thus, when a defined communication traffic flow is detected, it may bemapped to a particular error pattern thereby allowing furthercharacterization of the error by error type including possibleresolution.

Operations 304-308 of the computing device are triggered in response todetecting the presence of the error. At 304, upon detecting the error,operations of the computing device trigger receiving a health log fromeach of the system components (e.g. 104A-104N, collectively 104)together defining an aggregate health log. The health logs may be in astandardized format (e.g. JSON format) and utilize common keyidentifiers (e.g. connection identifier, digital identifier of atransaction, etc.). This allows consistency in the informationcommunicated and tracking of the messages such that it can be used todetermine a context of the messages and mapped to capture the keyidentifiers across the distributed components. In one aspect, the commonkey identifiers are used by the computing device for tracing a route ofthe messages communicated between the distributed system components andparticularly, for a transaction having the error. Additionally, in oneaspect, the health logs may follow a particular log pattern with one ormore metadata (e.g. customer identification number, traceabilityidentification number, timestamp, event information, etc.) which allowstracking and identification of messages communicated with thedistributed system components. An example of the format of the healthlogs in shown in FIG. 5 .

At 306 and further in response to detecting the error, operations of thecomputing device (e.g. diagnostics server 108) configure receiving froma data store of the computing device, network infrastructure informationdefining one or more relationship for connectivity and communicationflow between the system components. The relationships characterizedependency information between the system components. The networkinfrastructure information may indicate for example, how the componentsare connected to one another and for a set of defined operations, howthey are dependent upon and utilize resources of another component inorder to perform the defined operation.

At 308, operations of the computing device automatically determine atleast based on the aggregate health log and the network infrastructureinformation, a particular component of the system components originatingthe error and associated dependent components affected from the systemcomponents.

In a further aspect, automatically determining the origination of theerror in a distributed component system includes comparing each of thehealth logs to the other health logs in response to the relationships inthe network infrastructure information and may include mapping theinformation to predefined patterns for the logs to determine where thedeviations from the expected communications may have occurred.

Referring to FIG. 4 , shown is an example scenario for flow of messagesbetween distributed system components located both internal to anorganization (e.g. on a private network) and remote to the organization(e.g. outside the private network). FIG. 4 further illustratesmonitoring of health of the distributed components including errorsource origination detection for an error occurring in the message. Asshown in FIG. 4 , flow of messages may occur between internal systemcomponents 104A-104C located on a first computing device (e.g. computingdevice 102 of FIG. 1 ) and component 104D of an external computingdevice (e.g. a second computing device 102′) located outside theinstitution provided by systems A-C. Other variations of distributionsof the system components on computing devices may be envisaged. Forexample, each system component 104A-104D may reside on distinctcomputing devices altogether.

The path of a message is shown as travelling across link 401A to 401B to401C.

Thus, as described above, the automatic analyzer module 214 initiallyreceives a set of API logs (e.g. aggregate health logs 107A-107Ccharacterizing message activity for system components 104A-104D, eventsand/or errors communicated across links 401A-401C) in a standardizedformat. The standardized format may be JSON and one or more keyidentifiers that link together the API logs as being related to a taskor operation.

FIG. 5 illustrates example API logs 501-503 (a type of health logs107A-107C) which may be communicated between system components 104 suchas system components 104A-104D of FIG. 4 . For example, each API logfrom an API system component 104 would include API event informationsuch as interactions with the API including calls or requests and itscontent. The API logs further include a timestamp 504 indicating a timeof message and a traceability ID 506 which allows tracking a messagepath from one API to another (e.g. as shown in API logs 501-503).

For example, a message sent from a first API to a second API would havethe same traceability ID (or at least a common portion in thetraceability ID 506) with different timestamps 504. As noted above, whenan error is detected in the overall system (e.g. error 507 in API log503), the API logs 501-503 for all of the system components are reviewedat the automatic analyzer module 214. Additionally, the automaticanalyzer module 214 receives network infrastructure info 111 metadatawhich defines relationships between the various API components 104 inthe system including which component systems are dependent on others foreach pre-defined type of action (e.g. message communication, performinga particular task, accessing a resource, etc.). Further, the automaticanalyzer module 214 may retrieve from a data store 116, a set of healthmonitoring rules 109 which can define historical error patterns (e.g. anerror of type X typically follows a path from API 1 to API 2) torecognize and diagnose errors. For example, the set of health monitoringrules 109 may map a traffic pattern between the API logs (e.g. API logs501-503) to a particular type of error.

Thus, referring again to FIGS. 4 and 5 , once an error is detected inthe overall system (e.g. the error 507), the automatic analyzer module214 utilizes the aggregate API logs 107A-107C (e.g. received from eachof the system components having the same traceability ID), the networkinfrastructure information 111 and the monitoring rules 109 to determinewhich of the system components originated the error, characterizationsof the error (e.g. based on historical error patterns) and associateddependent components directly affected by the error 507. The disclosedmethod and system allows diagnosis of health of application datacommunicated between APIs and locating the errors for subsequentanalysis, in one or more aspects.

Subsequent to the above automatic determination of application health bythe automatic analyzer module 214, including characterizing the error507 (e.g. based on monitoring rules 109 characterizing prior errorissues and types communicated between system components 104A-104D) alongwith which component(s) are responsible for the error 507 in the system(e.g. based on digesting the network infrastructure info 111) andassociated components, the system may provide the diagnostic results asan alert to a user interface. The user interface may be associated withthe automatic analyzer module 214 so that a user (e.g. system support)can see which API(s) are having issues and determine correctivemeasures. The user interface may display the results either on thediagnostics server 108 or any of the computing devices 102 for furtheraction. This allows the network 100 shown in FIG. 1 to monitor itsdistributed components 104 and be proactive in providing errornotification diagnostics for their systems support. The alert may be anemail, a text message, a video message or any other types of visualdisplays as envisaged by a person skilled in the art. In one aspect, thealert may be displayed on a particular device based on the particularcomponent originating the error as determined from the received healthlogs. In a further aspect, the alert is displayed on the user interfacealong with metadata characterizing the error including associateddependent components to the particular component originating the error.

Referring to FIGS. 4-6 , an example of the automatic analyzer module 214generating and sending such an alert 600 to a computing device (e.g.102, 102′ or 106, etc.) responsible for error resolution in the systemcomponent 104 which generated the error is shown in FIG. 6 . In the caseof FIG. 6 , the automatic analyzer module 214 is configured to generatean email to the operations or support team (e.g. provided via a unifiedmessaging platform and accessible via the computing devices in FIG. 1 )detailing the error and reasoning for the error for subsequentresolution thereof.

Referring now to FIG. 7 shown is an example flow of messages 700,provided in at least one aspect, shown as Message(1)-Message(3)communicated in the network 100 of FIG. 1 between distributed systemcomponents 104A-104C (e.g. web tier(s) and API components) associatedwith distinct computing devices 102A, 102B, and 102C, collectivelyreferred to as 102. The health of the distributed applications ismonitored via health logs 107A-107C (e.g.asyncMessage(1)-asyncMessage(3)) and subsequently analyzed by thediagnostics server 108 via the automatic analyzer module 214 (e.g. alsoreferred to as UDD—unified deep diagnostic analytics). As noted above,the health logs 107 may utilize a standardized JSON format defining aunified smart log pattern (USLP). The unified smart log pattern of thehealth logs 107 may enable a better understanding of the flow ofmessages; provide an indication of functional dependencies between thesystem components; and utilize a linking key metadata that connectsmessages via a common identifier (e.g. customer ID).

Additionally, as noted above, the automatic analyzer module 214 monitorsthe health logs and may apply a set of monitoring rules (e.g. monitoringrules 109 in FIG. 1 ) to detect errors including the origination sourcevia pre-defined error patterns shown at step 702 and the expectedoperational resolution. In at least some aspects, the monitoring rules109 applied by the analytics analyzer module 214 may include a decisiontree or other machine learning trained model which utilizes prior errorpatterns to predict the error pattern in the current flow of messages700. The results of the error analysis may be provided to a userinterface at step 704, e.g. via another computing device 706 for furtherresolution. An example of the notification provided at step 704 to theother computer 706 responsible for providing system support and errorresolution for the system component which originated the error is shownin FIG. 6 . The notification provided at step 704 may be provided viae-mail, short message service (SMS), a graphical user interface (GUI), adashboard (e.g. a type of GUI providing high level view of performanceindicators), etc.

In one or more examples, the functions described may be implemented inhardware, software, firmware, or any combination thereof. If implementedin software, the functions may be stored on or transmitted over, as oneor more instructions or code, a computer-readable medium and executed bya hardware-based processing unit.

Computer-readable media may include computer-readable storage media,which corresponds to a tangible medium such as data storage media, orcommunication media including any medium that facilitates transfer of acomputer program from one place to another, e.g., according to acommunication protocol. In this manner, computer-readable mediagenerally may correspond to (1) tangible computer-readable storagemedia, which is non-transitory or (2) a communication medium such as asignal or carrier wave. Data storage media may be any available mediathat can be accessed by one or more computers or one or more processorsto retrieve instructions, code and/or data structures for implementationof the techniques described in this disclosure. A computer programproduct may include a computer-readable medium. By way of example, andnot limitation, such computer-readable storage media can comprise RAM,ROM, EEPROM, optical disk storage, magnetic disk storage, or othermagnetic storage devices, flash memory, or any other medium that can beused to store desired program code in the form of instructions or datastructures and that can be accessed by a computer. Also, any connectionis properly termed a computer-readable medium. For example, ifinstructions are transmitted from a website, server, or other remotesource using wired or wireless technologies, such are included in thedefinition of medium. It should be understood, however, thatcomputer-readable storage media and data storage media do not includeconnections, carrier waves, signals, or other transient media, but areinstead directed to non-transient, tangible storage media.

Instructions may be executed by one or more processors, such as one ormore general purpose microprocessors, application specific integratedcircuits (ASICs), field programmable logic arrays (FPGAs), digitalsignal processors (DSPs), or other similar integrated or discrete logiccircuitry. The term “processor,” as used herein may refer to any of theforegoing examples or any other suitable structure to implement thedescribed techniques. In addition, in some aspects, the functionalitydescribed may be provided within dedicated software modules and/orhardware. Also, the techniques could be fully implemented in one or morecircuits or logic elements. The techniques of this disclosure may beimplemented in a wide variety of devices or apparatuses, an integratedcircuit (IC) or a set of ICs (e.g., a chip set).

One or more currently preferred embodiments have been described by wayof example. It will be apparent to persons skilled in the art that anumber of variations and modifications can be made without departingfrom the scope of the invention as defined in the claims.

What is claimed is:
 1. A computing device for monitoring health of adistributed computer system, the computing device having a processorcoupled to a memory, the memory storing instructions which when executedby the processor configure the computing device to: track communicationbetween a plurality of interconnected system components of thedistributed computer system and monitor for an alert indicating an errorin the communication, responsive to detecting the error: receive ahealth log from each of the system components, each said health logbeing in a standardized format indicating messages communicated betweenthe system components; capture, from each said health log, common keyidentifiers for tracing a route of the messages communicated for atransaction having the error; receive network infrastructure informationdefining relationships for connectivity between the system components,the relationships characterizing dependency information between thesystem components; and automatically determine, based on applying thenetwork infrastructure information to the health logs including thecommon key identifiers and further mapping to a set of health monitoringrules comprising data integrity information, a particular component ofthe system components originating the error and associated dependentcomponents affected.
 2. The computing device of claim 1, furthercomprising obtaining the health monitoring rules from a data storewherein the data integrity information is for pre-defined communicationsbetween the system components, the instructions configuring thecomputing device to apply the set of health monitoring rules forverifying whether each said health log comprising the common keyidentifiers complies with the data integrity information.
 3. Thecomputing device of claim 2, wherein the health monitoring rules arefurther defined based on historical error patterns, derived fromhistorical health logs, for the distributed computer system associatinga set of traffic flows potentially occurring for the messagescommunicated between the system components as derived from respectivecommon key identifiers in the historical health logs to a correspondingerror type for the error pattern.
 4. The computing device of claim 1,wherein the common key identifiers, common to the system componentscommunicating in a particular transaction, link a particular task to themessages communicated for that particular task and depict the route ofthe messages communicated between the system components for theparticular task having the error.
 5. The computing device of claim 1wherein the common key identifiers further identify the systemcomponents and types of events or messages communicated for eachtransaction.
 6. The computing device of claim 1, wherein the common keyidentifiers link two or more parties affecting a transaction.
 7. Thecomputing device of claim 1, wherein the common key identifiers comprisekey metadata that interconnects the system components via an entityfunction role.
 8. The computing device of claim 1, wherein theinstructions configure the computing device to modify the common keyidentifiers each time it is processed or communicated by one of thesystem components to identify a path taken by the messages.
 9. Thecomputing device of claim 1, wherein the instructions further configurethe computing device to: determine from the dependency informationindicating which of the system components are dependent on one anotherfor operations performed in the distributed computer system, an impactof the error originated by the particular component on the associateddependent components.
 10. The computing device of claim 9, wherein theinstructions further configure the computing device to perform, upondetecting the alert: displaying the alert on a user interface of aclient application for the device, the alert based on the particularcomponent originating the error.
 11. The computing device of claim 10,further comprising: displaying on the user interface along with thealert, the associated dependent components to the particular componentoriginating the error.
 12. The device of claim 1, wherein thestandardized format comprises a JSON format.
 13. The device of claim 1,wherein the system components are APIs (application programminginterfaces) on one or more connected computing devices and the healthlog is an API log for logging activity for the respective API incommunication with other APIs and related to the error.
 14. The deviceof claim 1, wherein the processor configuring the computing device toautomatically determine origin of the error further comprises: comparingeach of the health logs in an aggregate health log to the other healthlogs in response to the relationships in the network infrastructureinformation.
 15. A method implemented by a computing device, the methodfor monitoring health of a distributed computer system, the methodcomprising: tracking communication between a plurality of interconnectedsystem components of the distributed computer system and monitor for analert indicating an error in the communication, responsive to detectingthe error: receiving a health log from each of the system components,each said health log being in a standardized format indicating messagescommunicated between the system components; capturing, from each saidhealth log, common key identifiers for tracing a route of the messagescommunicated for a transaction having the error; receiving networkinfrastructure information defining relationships for connectivitybetween the system components, the relationships characterizingdependency information between the system components; and automaticallydetermining, based on applying the network infrastructure information tothe health logs including the common key identifiers and further mappingto a set of health monitoring rules comprising data integrityinformation, a particular component of the system components originatingthe error and associated dependent components affected.
 16. The methodof claim 15, further comprising obtaining the health monitoring rulesfrom a data store wherein the data integrity information is forpre-defined communications between the system components, the set ofhealth monitoring rules being applied for verifying whether each saidhealth log comprising the common key identifiers complies with the dataintegrity information.
 17. The method of claim 16, wherein the healthmonitoring rules are further defined based on historical error patterns,derived from historical health logs, for the distributed computer systemassociating a set of traffic flows potentially occurring for themessages communicated between the system components as derived fromrespective common key identifiers in the historical health logs to acorresponding error type for the error pattern.
 18. The method of claim15, wherein the common key identifiers, common to the system componentscommunicating in a particular transaction, link a particular task to themessages communicated for that particular task and depict the route ofthe messages communicated between the system components for theparticular task having the error.
 19. The method of claim 15, whereinthe common key identifiers further identify the system components andtypes of events or messages communicated for each transaction.
 20. Themethod of claim 19 wherein the common key identifiers link two or moreparties affecting a transaction.
 21. The method of claim 15, wherein thecommon key identifiers comprise key metadata that interconnects thesystem components via an entity function role.
 22. The method of claim15, further comprising updating the common key identifiers each time itis processed or communicated by one of the system components to identifya path taken by the messages.
 23. The method of claim 15, furthercomprising: determining from the dependency information indicating whichof the system components are dependent on one another for operationsperformed in the distributed computer system, an impact of the errororiginated by the particular component on the associated dependentcomponents.
 24. The method of claim 23, further comprising upondetecting the alert: displaying the alert on a user interface of aclient application for the device, the alert based on the particularcomponent originating the error.
 25. The method of claim 24, furthercomprising: displaying on the user interface along with the alert, theassociated dependent components to the particular component.
 26. Themethod of claim 25, wherein the standardized format comprises a JSONformat.
 27. The method of claim 15, wherein the system components areAPIs (application programming interfaces) on one or more connectedcomputing devices and the health log is an API log for logging activityfor the respective API in communication with other APIs and related tothe error.
 28. The method of claim 15, wherein automatically determiningorigination of the further comprises: comparing each of the health logsin an aggregate health log to the other health logs in response to therelationships in the network infrastructure information.
 29. A computerreadable medium comprising a non-transitory device storing instructionsand data, which when executed by a processor of a computing device, theprocessor coupled to a memory, configure the computing device to: trackcommunication between a plurality of interconnected system components ofa distributed computer system and monitor for an alert indicating anerror in the communication in the distributed computer system, and upondetecting the error: receive a health log from each of the systemcomponents, each said health log being in a standardized formatindicating messages communicated between the system components; capture,from each said health log, common key identifiers for tracing a route ofthe messages communicated for a transaction having the error; receivenetwork infrastructure information defining relationships forconnectivity between the system components, the relationshipscharacterizing dependency information between the system components; andautomatically determine, based on applying the network infrastructureinformation to the health logs including the common key identifiers andfurther mapping to a set of health monitoring rules comprising dataintegrity information, a particular component of the system componentsoriginating the error and associated dependent components affected.